1. Figure out data

1-1. What is ZooScore?

ZooScore dataset compiles ZooScores determined for a variety of pathogens and parasites collected from the Global Mammal Parasite Database (GMPD). The image below shows the decision tree that a ZooScore is calculated with, ranging from a score of -1 representing a pathogen not found in humans to a score of 3 representing a pathogen capable of human to human transmission (e.g., SARS-CoV-2).

Scores range from -1 to 3, with 1 to 3 indicating zoonotic potential.

1-2. Importance of Exploring ZooScore

  • Revealing the Unexplored
  • Enhance understanding of zoonotic diseases and their origins.
  • Highlight candidates for potential investigation.

1-3. Basic Information

The first step I took was to thoroughly understand the dataset by creating various visuals. These included simple bar graphs to display counts, box plots to visualize distributions, and summary statistics to gain insights into the dataset’s overall characteristics.

visdat::vis_dat(ZOO)+coord_flip()+scale_fill_viridis_d()+
  theme(axis.text.x = element_text(angle = 0, hjust = 1))

There are 28 columns and 2008 rows. Each column represents a variable related to the parasite and its zooscore calculated by investigators. Each row represents each parasite. Since the variable parasite_corrected_name plays a role of index, the total number of rows and the unique number of parasite_corrected_name should be matched. To verify this, I displayed how many distinct values of parasite_corrected_name exist.

Some variables have too many missing values. In particular, insect, commensal, xc_notes, pgf_zoo_score, pgf_c_score, pgf_notes, notes, print_ref, xc_citation, pgf_citation, pgf_more_citations, nematode.

1-4. Data quality

Strength: A broad spectrum of over 2,000 pathogens

Limitation: Lack of complete biological context for pathogens

2. Contextualize

As I delved deeper into the data, I recognized the need to enhance its context. To achieve this, I merged ZooScore dataset with several related sources. For instance, I connected pathogens, species, and diseases using the Gideon Pathogens-Species-Disease dataset. Additionally, the Gideon Disease Traits dataset provided valuable insights. To better understand the animal groups, I utilized the Mammal Taxonomy Dictionary dataset. To visualize geographical distribution, I incorporated the Natural Earth dataset.

2-1. Meaningful Variables

xc_c_score & xc_zoo_score

The xc_c_score represents the cross-checked confidence score after review by multiple individuals.The score represents the confidence level in the ZooScore, with 1 indicating high confidence and 3 indicating low/no confidence.The values in xc_c_score appear to be more complete compared to confidence_score, as there are less missing (NA) values. All data points are within the expected range.

ZOO%>%
  mutate(xc_zoo_score= xc_zoo_score)%>%group_by(xc_zoo_score, xc_c_score)%>%
  summarise(n_row = length(unique(na.omit(parasite_corrected_name))))%>%
  ggplot(aes(x =xc_zoo_score,
             y = xc_c_score))+
  geom_tile(aes(fill=n_row), color = "black",
            size = 0.6) +
  geom_label(mapping = aes(label = n_row,
                          color = n_row > median(n_row)),
            size = 2.5)+
  scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
                                                "FALSE" = '#091F40'))+
  scale_fill_continuous()+
  theme_bw()+
  scale_x_continuous(breaks = -2:3,
                     expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0),
                     breaks = 1:3) +
  labs(x = "ZooScore",
       y = "Confidence Score",
       fill = "Num") +
  theme(aspect.ratio = 0.3) +
  theme(axis.text.x = element_text(angle = 0, hjust=1))+
  guides(fill = guide_colorbar(ticks = T,
                               ticks.colour = "black",
                               ticks.linewidth = 1,
                               frame.colour = "black",
                               frame.linewidth = 1,
                               barwidth = 1,
                               barheight = 7))

2-2 Species Richness in Zoonotic pathogens

data.frame(table(gsub("^\\s+|\\s+$", "", unlist(strsplit(GID$ParasiteGMPD, ","))))) %>%
  left_join(selected, by = c("Var1" = "parasite_corrected_name")) %>% 
  group_by(Var1) %>% 
  summarise(species_richness = Freq, 
            xc_zoo_score = mean(xc_zoo_score)) %>% 
   filter(!is.na(xc_zoo_score)) %>% 
  ggplot() +
  geom_jitter(aes(x = xc_zoo_score,
                  y = species_richness,
                  color = as.factor(xc_zoo_score)),
              width = 0.3, 
              show.legend = F) +
  
  geom_boxplot(aes(group = xc_zoo_score, 
                   y = species_richness,
                   x = xc_zoo_score,
                   color = as.factor(xc_zoo_score)),
               alpha = 0.3, 
               outlier.alpha = 0, 
               show.legend = F, 
               width = 0.4) +
  labs(x = "ZooScore",
       y = "Species richness") +
  scale_color_brewer(palette = "Dark2") +
  scale_x_continuous(breaks = 1:3,
                     expand = c(0.1, 0.1)) +
  scale_y_continuous(labels = scales::comma) +
  theme(panel.background = element_rect(fill = "white",
                                        color = "black"),
        panel.grid.major = element_line(color = "grey80"),
        aspect.ratio = 0.8)

2-3. Zoonotic Pathogens Search Hits

library(patchwork)
google+wofs

2-4. Species Richness by Order and Zooscore

p_all<- GID %>%
  left_join(MDD[, c("species", "order")], by = "species") %>% 
  left_join(ZOO, by = c("ParasiteGMPD" = "parasite_corrected_name")) %>%
  group_by( order, xc_zoo_score)%>% 
  summarise(species_richness = length(unique(species)),
            xc_zoo_score = mean(xc_zoo_score), .groups="drop") %>%
  filter(!is.na(xc_zoo_score),
         !is.na(order))%>%
  mutate(order = tools::toTitleCase(tolower(order))) %>% 
  ggplot(aes(x =xc_zoo_score,
             y = order))+
  geom_tile(aes(fill=species_richness), color = "black",
            size = 0.6) +
  geom_label(mapping = aes(label = species_richness,
                          color = species_richness > mean(species_richness)),size = 3)+
  scale_color_manual(guide = 'none', values = c('TRUE' = '#D31245',
                                                "FALSE" = '#091F40'))+
  scale_fill_continuous()+
  theme_bw()+
  scale_x_continuous(breaks = -2:3,
                     expand = c(0, 0)) +
  # scale_y_continuous(expand = c(0, 0),
  #                    breaks = 1:3) +
  labs(x = "ZooScore",
       y = "Order",
       fill = "species_richness") +
  # theme(aspect.ratio = 0.5) +
  theme(axis.text.x = element_text(angle = 0, hjust=1))+
  guides(fill = guide_colorbar(ticks = T,
                               ticks.colour = "black",
                               ticks.linewidth = 1,
                               frame.colour = "black",
                               frame.linewidth = 1,
                               barwidth = 1,
                               barheight = 7))
p_all

3. Area of Interest

Through the process of exploration and contextualization, I pinpointed specific areas of interest. I was particularly drawn to pathogens with higher zooscores, as they indicated potential significance. Moreover, I identified popular animal groups such as Rodents, Carnivora, and Artiodactyla, which boast a substantial number of species. These areas became the foundation for my subsequent analyses

3-1.Top 3 Orders

#TOP 3 Orders
p_rodent + theme(axis.title.x = element_blank()) + p_carnivora + theme(axis.title.x = element_text(face = "bold"), axis.title.y = element_blank()) + p_artiodactyla + theme(axis.title = element_blank()) 

4. Subset data

To narrow down my focus, I decided to work with a subset of the data. I directed my attention towards specific pathogens that aligned with my areas of interest. These pathogens were Toxoplasma gondii, Borrelia burgdorferi, and Hymenolepis diminuta. By honing in on these pathogens, I could dive deeper into their associated attributes.